📌 Title¶
Exploring Traffic Patterns on I-94: A Data-Driven Approach¶
📝 Project Description¶
In this project, I explored a dataset containing hourly traffic data from the I-94 Interstate to understand what factors influence heavy traffic. My goal was to identify patterns based on weather, time of day, and day of the week — and to practice building insights through exploratory data analysis.
The traffic data focuses on the westbound direction, from Saint Paul to Minneapolis.
🎯 Objective¶
I wanted to investigate whether traffic volume is affected by external factors like seasonal changes (e.g., summer vs. winter) and weather conditions (e.g., snow or rain).
The I-94 Traffic Dataset¶
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns,plotly.express as px, plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = 'notebook_connected' # or 'iframe_connected' if you want isolation
i_94_traffic = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
i_94_traffic.head(5)
| holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 |
| 1 | NaN | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 |
| 2 | NaN | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 |
| 3 | NaN | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 |
| 4 | NaN | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 |
i_94_traffic.tail(5)
| holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
|---|---|---|---|---|---|---|---|---|---|
| 48199 | NaN | 283.45 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2018-09-30 19:00:00 | 3543 |
| 48200 | NaN | 282.76 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 20:00:00 | 2781 |
| 48201 | NaN | 282.73 | 0.0 | 0.0 | 90 | Thunderstorm | proximity thunderstorm | 2018-09-30 21:00:00 | 2159 |
| 48202 | NaN | 282.09 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 22:00:00 | 1450 |
| 48203 | NaN | 282.12 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 23:00:00 | 954 |
i_94_traffic.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48204 entries, 0 to 48203 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 61 non-null object 1 temp 48204 non-null float64 2 rain_1h 48204 non-null float64 3 snow_1h 48204 non-null float64 4 clouds_all 48204 non-null int64 5 weather_main 48204 non-null object 6 weather_description 48204 non-null object 7 date_time 48204 non-null object 8 traffic_volume 48204 non-null int64 dtypes: float64(3), int64(2), object(4) memory usage: 3.3+ MB
📂 Dataset Overview¶
The dataset contains 48,204 rows and 9 columns. Most columns are complete, except holiday, which has some missing values. Each row captures weather and traffic data for a specific hour.
The time range spans from 2012-10-02 09:00:00 to 2018-09-30 23:00:00.
🚦 Understanding Traffic Volume¶
fig = px.histogram(i_94_traffic,i_94_traffic['traffic_volume'],labels={'traffic_volume':'Traffic Volume','count':'C'},title='Traffic Volume Distribution', nbins=10,text_auto=True,width=600)
fig.show()
i_94_traffic['traffic_volume'].describe()
count 48204.000000 mean 3259.818355 std 1986.860670 min 0.000000 25% 1193.000000 50% 3380.000000 75% 4933.000000 max 7280.000000 Name: traffic_volume, dtype: float64
To begin, I visualized the distribution of the traffic_volume column using a histogram. Here's a quick statistical summary:
Minimum:
0Maximum:
7,280Mean:
~3,26025th percentile:
~1,19375th percentile:
~4,933
This tells us that traffic volume varies a lot, and it seems like there are distinct periods of low and high volume. That led me to investigate how traffic differs between daytime and nighttime.
🌙 Day vs. Night Traffic¶
I split the data into:
Daytime:
07:00to19:00(7 AM to 7 PM)Nighttime:
19:00to07:00(7 PM to 7 AM)
i_94_traffic['date_time'] = pd.to_datetime(i_94_traffic['date_time'])
daytime_traffic = i_94_traffic.copy()[(i_94_traffic['date_time'].dt.hour >= 7) & (i_94_traffic['date_time'].dt.hour < 19)]
nighttime_traffic = i_94_traffic.copy()[(i_94_traffic['date_time'].dt.hour >= 19) | (i_94_traffic['date_time'].dt.hour < 7) ]
print(f"Day Time Shape: {daytime_traffic.shape},\nNight Time Shape: {nighttime_traffic.shape}")
Day Time Shape: (23877, 9), Night Time Shape: (24327, 9)
i_94_traffic.iloc[176:178]
| holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
|---|---|---|---|---|---|---|---|---|---|
| 176 | NaN | 281.17 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-10 03:00:00 | 361 |
| 177 | NaN | 281.25 | 0.0 | 0.0 | 92 | Clear | sky is clear | 2012-10-10 06:00:00 | 5875 |
Daytime rows:
23,877Nighttime rows:
24,327
There’s a small difference in row count due to two missing hours in the dataset.
The notable difference between the daytime_traffic row count and nighttime_traffic row count is explainable by missing data. The data was not collected for two hours.
📊 Traffic Volume Distribution¶
fig = make_subplots(rows=1,cols=2,column_widths=[.50,.50])
# fig = go.Figure()
traffic_volume_day_plot = go.Histogram(x=daytime_traffic['traffic_volume'],nbinsx=10,name='Day',)
traffic_volume_night_plot = go.Histogram(x=nighttime_traffic['traffic_volume'],nbinsx=10,name='Night')
fig.add_trace(traffic_volume_day_plot,1,1)
fig.add_trace(traffic_volume_night_plot,1,2)
fig.update_layout(xaxis1_title_text='Traffic Volume',yaxis1_title_text='Frequency',xaxis2_title_text='Traffic Volume',yaxis2_title_text='Frequency', title={'text':"Traffic Volume Comparison: Day Vs Night",'x':0.5}, width=1000)
fig.show()
daytime_traffic['traffic_volume'].describe()
count 23877.000000 mean 4762.047452 std 1174.546482 min 0.000000 25% 4252.000000 50% 4820.000000 75% 5559.000000 max 7280.000000 Name: traffic_volume, dtype: float64
nighttime_traffic['traffic_volume'].describe()
count 24327.000000 mean 1785.377441 std 1441.951197 min 0.000000 25% 530.000000 50% 1287.000000 75% 2819.000000 max 6386.000000 Name: traffic_volume, dtype: float64
From the histograms:
Daytime traffic is mostly higher, with a peak around
4,000–5,000Nighttime traffic tends to be lower, with many hours under
2,000
The average hourly volume during the day is about 4,252, while at night it drops to around 1,785.
This confirmed my hypothesis: traffic is significantly heavier during the day.
🕒 Time-Based Patterns¶
Next, I explored how traffic changes across different time dimensions — starting with month, then day of the week, and finally hour of the day.
📅 By Month¶
daytime_traffic['month'] = daytime_traffic['date_time'].dt.month
daytime_traffic_by_month = daytime_traffic.groupby('month').mean(numeric_only=True)
fig = px.line(daytime_traffic_by_month,y=daytime_traffic_by_month['traffic_volume'],x=daytime_traffic_by_month.index, width=600,title="Month Indicator Plot")
fig.update_traces(textposition="bottom right")
fig.show()
daytime_traffic['year'] = daytime_traffic['date_time'].dt.year
daytime_traffic_in_july = daytime_traffic[daytime_traffic['month'] == 7 ]
daytime_traffic_in_july = daytime_traffic_in_july.groupby('year').mean(numeric_only=True)
fig = px.line(daytime_traffic_in_july,y=daytime_traffic_in_july['traffic_volume'],x=daytime_traffic_in_july.index, width=600,title="Traffic Each Year")
fig.update_traces(textposition="bottom right")
fig.show()
Adding a month column and plotting the mean traffic by month showed a few things:
Lower traffic in the early months of the year (January–March) and the last two (November–December)
Most months show higher volume — except for July, which dips unexpectedly
Digging deeper, I found that July 2016 had an especially low average — possibly due to major road construction, which aligns with reported I-94 lane closures that year. Lane Closure Article: I-696 closure, I-96/US-23 bridge work, I-94 lane closures
📆 By Day of the Week¶
daytime_traffic['dayofweek'] = daytime_traffic['date_time'].dt.dayofweek
traffic_by_dayofweek = daytime_traffic.groupby('dayofweek').mean(numeric_only=True)
fig = px.line(traffic_by_dayofweek,y=traffic_by_dayofweek['traffic_volume'],x=traffic_by_dayofweek.index, width=600,title="Traffic Each Day of The Week")
fig.update_traces(textposition="bottom right")
fig.show()
Grouping by dayofweek (where Monday = 0), I found:
Weekdays (Monday to Friday) consistently have higher traffic
Weekends show a clear drop in average volume
This makes sense given work commute patterns.
⏰ By Hour (Weekdays vs. Weekends)¶
To avoid weekend bias, I separated business days from weekends:
# Extract hour
daytime_traffic['hour'] = daytime_traffic['date_time'].dt.hour
# Separate weekdays (Mon–Fri) and weekends (Sat–Sun)
business_days = daytime_traffic[daytime_traffic['dayofweek'] <= 4]
weekend = daytime_traffic[daytime_traffic['dayofweek'] >= 5]
# Group by hour
by_hour_business = business_days.groupby('hour').mean(numeric_only=True)
by_hour_weekend = weekend.groupby('hour').mean(numeric_only=True)
# Create subplots
fig = make_subplots(rows=1, cols=2, column_widths=[0.5, 0.5])
# Line plots using Scatter with mode='lines'
traffic_volume_weekday_plot = go.Scatter(
x=by_hour_business.index,
y=by_hour_business['traffic_volume'],
mode='lines',
name='Weekday'
)
traffic_volume_weekend_plot = go.Scatter(
x=by_hour_weekend.index,
y=by_hour_weekend['traffic_volume'],
mode='lines',
name='Weekend'
)
# Add traces
fig.add_trace(traffic_volume_weekday_plot, row=1, col=1)
fig.add_trace(traffic_volume_weekend_plot, row=1, col=2)
# Update layout
fig.update_layout(
title={'text': 'Traffic Volume by Hour: Weekday vs Weekend', 'x': 0.5},
width=1000,
height=400,
xaxis1_title='Hour of Day',
yaxis1_title='Traffic Volume',
xaxis2_title='Hour of Day',
yaxis2_title='Traffic Volume'
)
fig.show()
When plotted:
Business days have two clear peaks: around
07:00and16:00, aligning with morning and evening commutesWeekends have flatter, lower traffic throughout the day
🌦️ Weather and Traffic
daytime_traffic.columns
Index(['holiday', 'temp', 'rain_1h', 'snow_1h', 'clouds_all', 'weather_main',
'weather_description', 'date_time', 'traffic_volume', 'month', 'year',
'dayofweek', 'hour'],
dtype='object')
daytime_traffic.info()
<class 'pandas.core.frame.DataFrame'> Index: 23877 entries, 0 to 48198 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 0 non-null object 1 temp 23877 non-null float64 2 rain_1h 23877 non-null float64 3 snow_1h 23877 non-null float64 4 clouds_all 23877 non-null int64 5 weather_main 23877 non-null object 6 weather_description 23877 non-null object 7 date_time 23877 non-null datetime64[ns] 8 traffic_volume 23877 non-null int64 9 month 23877 non-null int32 10 year 23877 non-null int32 11 dayofweek 23877 non-null int32 12 hour 23877 non-null int32 dtypes: datetime64[ns](1), float64(3), int32(4), int64(2), object(3) memory usage: 2.2+ MB
daytime_traffic[['temp', 'rain_1h', 'snow_1h', 'clouds_all', 'date_time', 'traffic_volume', 'month', 'year',
'dayofweek', 'hour']].corr()['traffic_volume']
temp 0.128317 rain_1h 0.003697 snow_1h 0.001265 clouds_all -0.032932 date_time -0.007153 traffic_volume 1.000000 month -0.022337 year -0.003557 dayofweek -0.416453 hour 0.172704 Name: traffic_volume, dtype: float64
# Create subplots
fig = make_subplots(rows=1, cols=2, column_widths=[0.5, 0.5])
# Left: Outlier Present
fig.add_trace(
go.Scatter(
x=daytime_traffic['traffic_volume'],
y=daytime_traffic['temp'],
mode='markers',
name='Outlier Present'
),
row=1, col=1
)
# Right: Outlier Absent (same data for now, but you can filter it if needed)
fig.add_trace(
go.Scatter(
x=daytime_traffic['traffic_volume'],
y=daytime_traffic['temp'],
mode='markers',
name='Outlier Absent'
),
row=1, col=2
)
# Update layout and axis titles
fig.update_layout(
title={
'text': 'Traffic Volume Vs Temperature',
'x': 0.5,
'xanchor': 'center'
},
width=1200,
height=400,
)
# Shared axis labels
fig.update_xaxes(title_text="Traffic Volume", row=1, col=1)
fig.update_yaxes(title_text="Temperature", row=1, col=1)
fig.update_xaxes(title_text="Traffic Volume", row=1, col=2)
fig.update_yaxes(title_text="Temperature", row=1, col=2, range=[230, 320]) # Outlier Absent
fig.show()
The Traffic Volume Vs Temperature graphs show that temperature is not a suatable indicator fo traffic volume.
It is best to explore other weather related columns.
Weather Types¶
# Group and aggregate the data
traffic_by_weather_main = daytime_traffic.groupby('weather_main').mean(numeric_only=True)
# Create a horizontal bar chart
fig = go.Figure(
go.Bar(
x=traffic_by_weather_main['traffic_volume'],
y=traffic_by_weather_main.index,
orientation='h'
)
)
# Update layout to match Matplotlib's style
fig.update_layout(
title='Traffic Volume Vs Weather Type',
xaxis_title='Traffic Volume',
yaxis_title='Weather Type',
height=500,
width=700
)
fig.show()
traffic_by_weather_main.describe()
| temp | rain_1h | snow_1h | clouds_all | traffic_volume | month | year | dayofweek | hour | |
|---|---|---|---|---|---|---|---|---|---|
| count | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 |
| mean | 283.735656 | 0.696040 | 0.000390 | 64.851857 | 4611.190225 | 6.662984 | 2015.723636 | 2.778513 | 12.377507 |
| std | 8.512431 | 1.170541 | 0.000648 | 22.809767 | 212.710591 | 0.378205 | 0.274677 | 0.317491 | 0.983460 |
| min | 267.984505 | 0.000000 | 0.000000 | 1.670265 | 4211.000000 | 5.832134 | 2015.321420 | 2.000000 | 10.325967 |
| 25% | 278.500233 | 0.027026 | 0.000000 | 63.333774 | 4480.452258 | 6.441921 | 2015.542564 | 2.752270 | 12.230705 |
| 50% | 283.812078 | 0.170804 | 0.000000 | 74.961435 | 4623.976475 | 6.734285 | 2015.619429 | 2.895102 | 12.467626 |
| 75% | 289.747717 | 0.949167 | 0.000559 | 75.527076 | 4796.992361 | 6.916667 | 2015.899443 | 2.944984 | 12.802994 |
| max | 296.730000 | 3.972943 | 0.001768 | 84.704417 | 4865.415996 | 7.108647 | 2016.261641 | 3.138928 | 14.000000 |
Weather Description¶
# Group by weather description and compute average traffic volume
traffic_by_weather_description = daytime_traffic.groupby('weather_description').mean(numeric_only=True)
# Optional: sort for better readability
traffic_by_weather_description = traffic_by_weather_description.sort_values('traffic_volume', ascending=True)
# Create horizontal bar chart
fig = go.Figure(
go.Bar(
x=traffic_by_weather_description['traffic_volume'],
y=traffic_by_weather_description.index,
orientation='h'
)
)
# Update layout to match matplotlib style
fig.update_layout(
title='Traffic Volume Vs Weather Type',
xaxis_title='Traffic Volume',
yaxis_title='Weather Type',
height=800,
width=800,
margin=dict(l=150) # Add more left margin if labels are long
)
fig.show()
The dataset includes several weather-related columns:
temprain_1hsnow_1hclouds_allweather_mainweather_description
Looking at average traffic volume by weather description, I found three weather types that stood out — all with over 5,000 vehicles/hour on average:
Shower snowLight rain and snowProximity thunderstorm with drizzle
At first this seemed odd — why would bad weather increase traffic? My hypothesis is that during unpleasant but not extreme weather, people are more likely to drive instead of biking or walking, which results in a bump in car usage.
✅ Final Summary¶
This analysis revealed two main types of indicators for high traffic:
🔹 Time-Based Indicators¶
Heavier traffic during warmer months (
March–October)Weekdays are busier than weekends
Rush hour peaks at around
07:00and16:00
🔹 Weather-Based Indicators¶
Some moderately bad weather conditions lead to higher traffic volumes
Possibly because people prefer using personal vehicles during such weather
This project helped me practice:
Working with time series data
Cleaning and transforming datetime features
Visualizing distributions and trends
Formulating and testing data-driven hypotheses
I'm excited to keep building on these skills as I explore more complex datasets and start incorporating machine learning into my projects!